NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Temporally Streaming Audio-Visual Synchronization for Real-World Videos

https://doi.org/10.1109/WACV61041.2025.00490

Voas, Jordan; Tseng, Wei-Cheng; Berry, Layne; Hu, Xixi; Peng, Puyuan; Stuedemann, James; Harwath, David (February 2025, IEEE)

Free, publicly-accessible full text available February 26, 2026
Neural Codec Language Models for Disentangled and Textless Voice Conversion

https://doi.org/10.21437/Interspeech.2024-1298

Baade, Alan; Peng, Puyuan; Harwath, David (September 2024, ISCA)

Full Text Available
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

https://doi.org/10.18653/v1/2024.acl-long.673

Peng, Puyuan; Huang, Po-Yao; Li, Shang-Wen; Mohamed, Abdelrahman; Harwath, David (August 2024, Association for Computational Linguistics)

Full Text Available
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Chen, Changan; Peng, Puyuan; Baid, Ami; Xue, Zihui; Hsu, Wei-Ning; Harwath, David; Grauman, Kristen (July 2024, https://doi.org/10.48550/arXiv.2406.09272)

Generating realistic audio for human actions is critical for applications such as film sound effects and virtual reality games. Existing methods assume complete correspondence between video and audio during training, but in real-world settings, many sounds occur off-screen or weakly correspond to visuals, leading to uncontrolled ambient sounds or hallucinations at test time. This paper introduces AV-LDM, a novel ambient-aware audio generation model that disentangles foreground action sounds from ambient background noise in in-the-wild training videos. The approach leverages a retrieval-augmented generation framework to synthesize audio that aligns both semantically and temporally with the visual input. Trained and evaluated on Ego4D and EPIC-KITCHENS datasets, along with the newly introduced Ego4D-Sounds dataset (1.2M curated clips with action-audio correspondence), the model outperforms prior methods, enables controllable ambient sound generation, and shows promise for generalization to synthetic video game clips. This work is the first to emphasize faithful video-to-audio generation focused on observed visual content despite noisy, uncurated training data.
more » « less
Full Text Available
SpeechCLIP+: Self-Supervised Multi-Task Representation Learning for Speech Via Clip and Speech-Image Data

https://doi.org/10.1109/ICASSPW62465.2024.10625827

Wang, Hsuan-Fu; Shih, Yi-Jen; Chang, Heng-Jui; Berry, Layne; Peng, Puyuan; Lee, Hung-Yi; Wang, Hsin-Min; Harwath, David (April 2024, IEEE)

Full Text Available
Integrating Self-Supervised Speech Model with Pseudo Word-Level Targets from Visually-Grounded Speech Model

https://doi.org/10.1109/ICASSPW62465.2024.10625802

Fang, Hung-Chieh; Ye, Nai-Xuan; Shih, Yi-Jen; Peng, Puyuan; Wang, Hsuan-Fu; Berry, Layne; Lee, Hung-Yi; Harwath, David (April 2024, IEEE)

Full Text Available
Syllable Discovery and Cross Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

Peng, Puyuan; Li, Shang-Wen; Rasanen, Okko; Mohamed, Abdelrahman; Harwath, David (August 2023, Interspeech)

Full Text Available
Audio-Visual Neural Syntax Acquisition

Lai, Cheng-I Jeff; Shi, Freda; Peng, Puyuan; Kim, Yoon; Gimpel, Kevin; Chang, Shiyu; Chuang, Yung-Sung; Bhati, Saurabhchand; Cox, David; Harwath, David; et al (December 2023, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU))

Full Text Available
A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings

Peng, Puyuan; Kamper, Herman; Livescu, Karen. (January 2020, Advances in neural information processing systems)

We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation. The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages. Our model, which we refer to as a maximal sampling correspondence variational autoencoder (MCVAE), is a recurrent neural network (RNN) trained with a novel self-supervised correspondence loss that encourages consistency between embeddings of different instances of the same word. Our training scheme improves on previous correspondence training approaches through the use and comparison of multiple samples from the approximate posterior distribution. In the zero-resource setting, the MCVAE can be trained in an unsupervised way, without any ground-truth word pairs, by using the word-like segments discovered via an unsupervised term discovery system. In both this setting and a semi-supervised low-resource setting (with a limited set of ground-truth word pairs), the MCVAE outperforms previous state-of-the-art models, such as Siamese-, CAE- and VAE-based RNNs.
more » « less
Full Text Available

Search for: All records